9.2 - How Action Masking Works

Action Masking is an architectural pattern that solves the Invalid Action Problem by injecting domain knowledge directly into the agent's decision-making process. Instead of letting the agent explore the entire theoretical action space, the environment provides a "mask" on each step that restricts the agent's choices to only those that are currently valid.

The Core Concept- Pre-filtering, Not Post-correction

The mask is not used to correct an agent's mistake. Rather, it is provided as part of the observation, allowing the agent's policy network to completely ignore invalid actions from the outset.

The Right Tool- `sb3-contrib` and `MaskablePPO`

The stable-baselines3 ecosystem provides a purpose-built solution for this in the sb3-contrib package.

sb3-contrib: A companion library with experimental and advanced algorithms.
MaskablePPO: A version of PPO specifically designed to accept and utilize an action mask during training. Using this is the standard, best-practice approach.

Architectural Changes Required

Implementing action masking requires a specific modification to the environment's API and the data flow between the environment and the agent.

Component	"Before" (No Masking)	"After" (With Masking)	Rationale
Observation Space	`gym.spaces.Box`	`gym.spaces.Dict`	The observation must now be a dictionary containing both the original state and the action mask.
Observation Data	A single `numpy.ndarray`	A dictionary: `{"obs": ndarray, "action_mask": ndarray}`	The dictionary structure is the standard way to pass multiple, distinct pieces of information to the agent.
RL Algorithm	`stable_baselines3.PPO`	`sb3_contrib.MaskablePPO`	`MaskablePPO` is specifically designed to parse the dictionary observation and use the `"action_mask"` key to constrain its policy output.

The Action Masking Workflow

This diagram illustrates the new data flow with action masking implemented.

+--------------------------+
|  1. Environment State    |
| (Game Step N)            |
+--------------------------+
             |
             +----------------------------------+
             |                                  |
             v                                  v
+--------------------------+     +--------------------------+
|  2. Generate State Obs   |     |  3. Generate Action Mask |
| (The unit feature matrix)|     | (The boolean vector)     |
+--------------------------+     +--------------------------+
             |                                  |
             |                                  |
             v                v
+-------------------------------------------------------------+
|  4. Combine into a Dictionary Observation                   |
|  (e.g., {"obs": obs_data, "action_mask": mask_data})        |
+-------------------------------------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|  5. `MaskablePPO` Agent receives the dictionary. It         |
|     applies the mask to its neural network output.          |
+-------------------------------------------------------------+
                             |
                             v
+----------------------------------------------------------------+
|  6. Agent samples a new action that is guaranteed to be valid. |
+----------------------------------------------------------------+

This architecture dramatically prunes the agent's search space, which is an essential step for making a complex, decomposed action space tractable for learning.

Implementation Plan

Step 1: Install sb3-contrib (pip install sb3-contrib).
Step 2: Modify train.py to import and use MaskablePPO.
Step 3: Modify DRL_RM_Env to use a gym.spaces.Dict observation space.
Step 4: Implement the mask generation logic inside the DRL_RM_Bot's on_step method.

The next section will provide the full code for these changes.

The Core Concept- Pre-filtering, Not Post-correction​

The Right Tool- sb3-contrib and MaskablePPO​

Architectural Changes Required​

The Action Masking Workflow​

Implementation Plan​

The Core Concept- Pre-filtering, Not Post-correction

The Right Tool- `sb3-contrib` and `MaskablePPO`

Architectural Changes Required

The Action Masking Workflow

Implementation Plan